Paired t-test (ttest_rel) — comparing means with paired data#
The paired t-test (also called the dependent-samples t-test) answers a very specific question:
Do two measurements taken on the same units differ on average?
Examples:
Before vs after an intervention on the same people
Two measurement methods applied to the same items
Matched pairs (e.g., twins, matched customers, matched locations)
The key idea is that pairing lets you remove unit-to-unit variability by working with within-pair differences.
What you’ll learn#
when a paired t-test is appropriate (and when it isn’t)
how it reduces to a one-sample t-test on differences
what the t statistic measures and how to interpret the p-value
a NumPy-only implementation (including Monte Carlo p-values / critical values)
Plotly visuals to build intuition: paired lines, difference distribution, null distribution, t distribution
import math
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
try:
import scipy
from scipy import stats
except Exception: # pragma: no cover
scipy = None
stats = None
# Plotly rendering (CKC convention)
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
# Reproducibility
rng = np.random.default_rng(7)
np.set_printoptions(precision=6, suppress=True)
print("numpy:", np.__version__)
if scipy is not None:
print("scipy:", scipy.__version__)
numpy: 1.26.2
scipy: 1.15.0
1) What the paired t-test tests#
Suppose you observe paired measurements \((x_i, y_i)\) for \(i=1,\dots,n\). Define the within-pair differences
The paired t-test is simply a one-sample t-test applied to the \(d_i\)’s.
Parameter of interest: the mean difference \(\mu_d = \mathbb{E}[d]\)
Null hypothesis: \(H_0: \mu_d = \mu_0\) (usually \(\mu_0=0\))
Alternative:
two-sided: \(H_1: \mu_d \neq \mu_0\)
one-sided: \(H_1: \mu_d > \mu_0\) or \(H_1: \mu_d < \mu_0\)
Intuition: if there is no systematic change, the differences should fluctuate around 0.
2) When to use it (and when not)#
Use a paired t-test when:
You have two measurements per unit (same person/item/place twice), or matched pairs.
You care about the average within-unit change.
Different pairs are independent of each other.
Do not use a paired t-test when:
The samples are independent (different people in each group) → use an independent t-test instead.
You have more than two time points per unit → consider repeated-measures ANOVA / mixed models.
Pairing is unclear or mismatched (if you pair the wrong units, you create noise and bias).
Why pairing can help (variance reduction)#
A paired design often has positive correlation between \(x_i\) and \(y_i\) (the same unit tends to be “high” or “low” both times). Pairing exploits that:
If \(\mathrm{Cov}(X,Y) > 0\), the variance of the difference is smaller → the test can be more powerful.
3) Assumptions (what must be true for the classic p-value)#
Let \(d_i = y_i - x_i\).
Independence of pairs: \((d_1,\dots,d_n)\) are independent across units.
Approximate normality of differences: \(d_i\) are approximately normal (or \(n\) is large enough for the CLT to kick in).
No extreme outliers in differences: outliers can dominate \(\bar d\) and \(s_d\).
Notes:
The test is fairly robust to mild non-normality, especially as \(n\) grows, but it is not robust to strong outliers.
If the difference distribution is very non-normal / heavy-tailed, consider a Wilcoxon signed-rank test or a randomization/sign-flip test.
4) The test statistic (what you actually compute)#
Compute:
Sample mean difference $\(\bar d = \frac{1}{n}\sum_{i=1}^n d_i\)$
Sample standard deviation of differences $\(s_d = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (d_i-\bar d)^2}\)$
Standard error of the mean difference $\(\mathrm{SE}(\bar d) = \frac{s_d}{\sqrt{n}}\)$
The paired t-statistic is
Under \(H_0\) and the assumptions, \(t\) follows a Student t distribution with \(n-1\) degrees of freedom.
Interpretation of \(t\):
\(t\) is “how many standard errors away from \(\mu_0\) your sample mean difference is”.
5) A small worked example (with visuals)#
We’ll simulate a before/after dataset with a modest average increase.
n = 30
before = rng.normal(loc=50, scale=10, size=n)
true_effect = 2.5
after = before + true_effect + rng.normal(loc=0, scale=4, size=n)
diff = after - before
print("n:", n)
print("mean(before):", before.mean())
print("mean(after): ", after.mean())
print("mean(diff): ", diff.mean())
print("std(diff): ", diff.std(ddof=1))
n: 30
mean(before): 45.85930307250626
mean(after): 48.340805958609465
mean(diff): 2.4815028861032147
std(diff): 3.628114155415627
# Plot 1: paired lines (each line = one unit)
xs = np.tile([0, 1, np.nan], n)
ys = np.empty(3 * n)
ys[0::3] = before
ys[1::3] = after
ys[2::3] = np.nan
fig = go.Figure()
fig.add_trace(
go.Scatter(
x=xs,
y=ys,
mode="lines+markers",
line=dict(color="rgba(0,0,0,0.25)", width=1),
marker=dict(size=6, color="rgba(0,0,0,0.35)"),
showlegend=False,
hoverinfo="skip",
)
)
fig.add_trace(
go.Scatter(
x=[0, 1],
y=[before.mean(), after.mean()],
mode="lines+markers",
line=dict(color="#1f77b4", width=4),
marker=dict(size=11, color="#1f77b4"),
name="Mean",
)
)
fig.update_layout(
title="Paired measurements (each line is one paired unit)",
xaxis=dict(tickmode="array", tickvals=[0, 1], ticktext=["Before", "After"]),
yaxis_title="Measurement",
width=900,
height=450,
)
fig.add_annotation(
x=0.5,
y=max(before.max(), after.max()),
text=f"Mean change = {diff.mean():.2f}",
showarrow=False,
yshift=15,
)
fig
# Plot 2: before vs after (diagonal = 'no change')
lo = float(min(before.min(), after.min()))
hi = float(max(before.max(), after.max()))
fig = go.Figure()
fig.add_trace(
go.Scatter(
x=before,
y=after,
mode="markers",
marker=dict(size=10, opacity=0.75, color="#636EFA"),
name="Pairs",
)
)
fig.add_trace(
go.Scatter(
x=[lo, hi],
y=[lo, hi],
mode="lines",
line=dict(color="black", dash="dash"),
name="No change (y=x)",
)
)
fig.update_layout(
title="Before vs After (points above diagonal are increases)",
xaxis_title="Before",
yaxis_title="After",
width=650,
height=550,
)
fig
# Plot 3: distribution of differences
fig = go.Figure()
fig.add_trace(
go.Histogram(
x=diff,
nbinsx=18,
marker_color="#00A6D6",
opacity=0.8,
)
)
fig.add_vline(
x=0,
line_dash="dash",
line_color="black",
annotation_text="0 (no change)",
annotation_position="top left",
)
fig.add_vline(
x=float(diff.mean()),
line_color="#1f77b4",
line_width=3,
annotation_text=f"mean = {diff.mean():.2f}",
annotation_position="top right",
)
fig.update_layout(
title="Distribution of within-pair differences (after − before)",
xaxis_title="difference",
yaxis_title="count",
width=900,
height=450,
)
fig
6) Interpreting the paired t-test#
The paired t-test produces a t statistic and a p-value.
The t statistic measures how far \(\bar d\) is from \(\mu_0\) in units of its estimated standard error.
The p-value is:
Assuming \(H_0\) is true, the probability of seeing a t-statistic at least as extreme as the one we observed.
What the p-value does not mean:
It is not the probability that \(H_0\) is true.
It does not tell you the size of the effect.
A good paired t-test report typically includes:
\(\bar d\) (mean change)
a confidence interval for \(\mu_d\)
the t statistic and degrees of freedom
the p-value
an effect size (e.g., Cohen’s \(d_z = \bar d/s_d\))
7) NumPy-only implementation (from scratch)#
Below is a clean “low-level” implementation:
compute \(\bar d\), \(s_d\), \(\mathrm{SE}\), \(t\) directly
approximate p-values and critical values using Monte Carlo samples from the \(t\) distribution
This is intentionally explicit so you can see the mechanics.
In practice you would use SciPy for an exact CDF/PPF-based p-value.
def _paired_clean(x: np.ndarray, y: np.ndarray) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
x = np.asarray(x, dtype=float)
y = np.asarray(y, dtype=float)
if x.shape != y.shape:
raise ValueError(f"x and y must have the same shape; got {x.shape} vs {y.shape}")
mask = ~np.isnan(x) & ~np.isnan(y)
x = x[mask]
y = y[mask]
if x.size < 2:
raise ValueError("Need at least 2 non-NaN pairs")
d = y - x
return x, y, d
def paired_t_statistic(x: np.ndarray, y: np.ndarray, mu0: float = 0.0) -> dict:
'''Compute paired t-statistic ingredients (no CDF/PPF).'''
_, _, d = _paired_clean(x, y)
n = int(d.size)
df = n - 1
dbar = float(d.mean())
sd = float(d.std(ddof=1))
se = sd / math.sqrt(n)
if se == 0.0:
t = math.inf if dbar > mu0 else (-math.inf if dbar < mu0 else 0.0)
else:
t = (dbar - float(mu0)) / se
dz = (dbar - float(mu0)) / sd if sd > 0 else math.inf
return {
"n": n,
"df": df,
"mean_diff": dbar,
"sd_diff": sd,
"se_mean_diff": se,
"t": float(t),
"cohens_dz": float(dz),
}
def simulate_student_t(df: int, size: int, rng: np.random.Generator) -> np.ndarray:
'''Sample from Student's t using only NumPy RNG primitives.
If Z ~ N(0,1) and V ~ ChiSquare(df) independent, then
T = Z / sqrt(V/df) ~ t_df
'''
if df <= 0:
raise ValueError("df must be >= 1")
z = rng.standard_normal(size=size)
v = rng.chisquare(df=df, size=size)
return z / np.sqrt(v / df)
def paired_t_test_numpy(
x: np.ndarray,
y: np.ndarray,
*,
mu0: float = 0.0,
alternative: str = "two-sided",
alpha: float = 0.05,
n_mc: int = 300_000,
seed: int = 7,
) -> dict:
'''Paired t-test with Monte Carlo p-value + CI (NumPy only).
alternative: 'two-sided' | 'greater' | 'less'
'''
alt = alternative.lower().strip()
if alt not in {"two-sided", "greater", "less"}:
raise ValueError("alternative must be 'two-sided', 'greater', or 'less'")
if not (0 < alpha < 1):
raise ValueError("alpha must be in (0,1)")
base = paired_t_statistic(x, y, mu0=mu0)
t_obs = float(base["t"])
df = int(base["df"])
rng_local = np.random.default_rng(seed)
t_null = simulate_student_t(df=df, size=int(n_mc), rng=rng_local)
if alt == "two-sided":
p = float(np.mean(np.abs(t_null) >= abs(t_obs)))
tcrit = float(np.quantile(t_null, 1 - alpha / 2))
ci = (
base["mean_diff"] - tcrit * base["se_mean_diff"],
base["mean_diff"] + tcrit * base["se_mean_diff"],
)
elif alt == "greater":
p = float(np.mean(t_null >= t_obs))
tcrit = float(np.quantile(t_null, 1 - alpha))
ci = (base["mean_diff"] - tcrit * base["se_mean_diff"], math.inf)
else: # alt == 'less'
p = float(np.mean(t_null <= t_obs))
tcrit = float(np.quantile(t_null, alpha)) # negative
ci = (-math.inf, base["mean_diff"] - tcrit * base["se_mean_diff"])
return {
**base,
"mu0": float(mu0),
"alternative": alt,
"alpha": float(alpha),
"p_value_mc": p,
"ci_mc": (float(ci[0]), float(ci[1])),
"n_mc": int(n_mc),
"seed": int(seed),
}
res = paired_t_test_numpy(before, after, alternative="two-sided", alpha=0.05, n_mc=400_000, seed=1)
res
{'n': 30,
'df': 29,
'mean_diff': 2.4815028861032147,
'sd_diff': 3.628114155415627,
'se_mean_diff': 0.6623999880416477,
't': 3.7462302700814556,
'cohens_dz': 0.6839649415107614,
'mu0': 0.0,
'alternative': 'two-sided',
'alpha': 0.05,
'p_value_mc': 0.0009,
'ci_mc': (1.1329486494524064, 3.830057122754023),
'n_mc': 400000,
'seed': 1}
Reading the output#
mean_diffis \(\bar d\) (your estimated average change).tis the t-statistic.p_value_mcis a Monte Carlo approximation to the classical p-value.ci_mcis a Monte Carlo-based confidence interval for the mean difference.cohens_dzis an effect size in SD units of the differences.
If the CI for \(\mu_d\) does not include 0, the two-sided test will have p-value < \(\alpha\) (up to Monte Carlo noise).
# Visual: t distribution with the observed t statistic
def student_t_pdf(t: np.ndarray, df: int) -> np.ndarray:
t = np.asarray(t, dtype=float)
v = float(df)
c = math.gamma((v + 1) / 2) / (math.sqrt(v * math.pi) * math.gamma(v / 2))
return c * (1 + (t**2) / v) ** (-(v + 1) / 2)
df = int(res["df"])
t_obs = float(res["t"])
grid = np.linspace(-6, 6, 1200)
pdf = student_t_pdf(grid, df=df)
fig = go.Figure()
fig.add_trace(go.Scatter(x=grid, y=pdf, mode="lines", line=dict(color="#636EFA"), name=f"t(df={df})"))
thr = abs(t_obs)
left = grid <= -thr
right = grid >= thr
fig.add_trace(
go.Scatter(
x=grid[left],
y=pdf[left],
mode="lines",
fill="tozeroy",
line=dict(color="rgba(214,39,40,0.1)"),
fillcolor="rgba(214,39,40,0.25)",
showlegend=False,
)
)
fig.add_trace(
go.Scatter(
x=grid[right],
y=pdf[right],
mode="lines",
fill="tozeroy",
line=dict(color="rgba(214,39,40,0.1)"),
fillcolor="rgba(214,39,40,0.25)",
showlegend=False,
)
)
fig.add_vline(x=t_obs, line_width=3, line_color="black")
fig.add_vline(x=-t_obs, line_width=3, line_color="black", line_dash="dot")
fig.update_layout(
title=f"t distribution under H0 (df={df}) with observed t = {t_obs:.3f} (p≈{res['p_value_mc']:.4f})",
xaxis_title="t",
yaxis_title="density",
width=900,
height=450,
)
fig
8) A null distribution intuition (sign-flip / randomization)#
Under the paired null \(H_0: \mu_d = 0\), a common non-parametric idea is:
if there is truly no systematic direction of change, then each difference \(d_i\) is just as likely to be positive as negative
A sign-flip test builds a null distribution by randomly multiplying each \(d_i\) by \(+1\) or \(-1\) and recomputing the mean.
This is not the same as the t-test, but it is a great intuition-builder:
it shows what “extreme” means
it highlights how the p-value is a tail probability under a null model
B = 50_000
signs = rng.choice([-1.0, 1.0], size=(B, n))
mean_null = (signs * diff).mean(axis=1)
obs = float(diff.mean())
p_signflip = float(np.mean(np.abs(mean_null) >= abs(obs)))
fig = go.Figure()
fig.add_trace(
go.Histogram(
x=mean_null,
nbinsx=80,
marker_color="#AB63FA",
opacity=0.8,
)
)
fig.add_vline(
x=0,
line_dash="dash",
line_color="black",
annotation_text="0",
annotation_position="top left",
)
fig.add_vline(
x=obs,
line_color="black",
line_width=3,
annotation_text=f"observed mean diff = {obs:.2f}",
annotation_position="top right",
)
fig.update_layout(
title=f"Sign-flip null distribution of the mean difference (p≈{p_signflip:.4f})",
xaxis_title="mean difference under H0",
yaxis_title="count",
width=900,
height=450,
)
fig
9) Practical usage (SciPy) + sanity check#
The canonical implementation is scipy.stats.ttest_rel.
If SciPy is available, we can compare results to ensure our t-statistic matches and the p-value is close (Monte Carlo p-values will differ slightly because they are approximations).
if stats is None:
print("SciPy not available; skipping this section.")
else:
# SciPy's ttest_rel uses y and x as paired samples.
try:
scipy_res = stats.ttest_rel(after, before, alternative="two-sided")
except TypeError:
# Older SciPy may not support `alternative=`
scipy_res = stats.ttest_rel(after, before)
dbar = diff.mean()
sd = diff.std(ddof=1)
se = sd / math.sqrt(n)
df = n - 1
ci_scipy = stats.t.interval(0.95, df=df, loc=dbar, scale=se)
print("SciPy t:", float(scipy_res.statistic))
print("SciPy p:", float(scipy_res.pvalue))
print("SciPy 95% CI for mean diff:", tuple(map(float, ci_scipy)))
print()
print("Our Monte Carlo estimate:")
print("t:", res["t"])
print("p_value_mc:", res["p_value_mc"])
print("ci_mc:", res["ci_mc"])
SciPy t: 3.7462302700814556
SciPy p: 0.0007931638812911782
SciPy 95% CI for mean diff: (1.1267427956120888, 3.8362629765943406)
Our Monte Carlo estimate:
t: 3.7462302700814556
p_value_mc: 0.0009
ci_mc: (1.1329486494524064, 3.830057122754023)
10) Pitfalls & diagnostics#
Forgetting the pairing
Don’t reorder one vector and not the other.
Pairing mistakes can completely change results.
Interpreting “non-significant” as “no effect”
A large p-value means the data are not very incompatible with \(H_0\).
It does not prove \(H_0\).
Ignoring effect size
Always report \(\bar d\) and a CI.
Small effects can be statistically significant with large \(n\).
Outliers in differences
Check the difference distribution.
Consider robust alternatives if needed.
Multiple testing
If you run many tests, adjust for multiplicity (or use hierarchical modeling).
Exercises#
Change
true_effectand the noise level in the simulation. How do \(t\), the p-value, and the CI respond?Increase
n. When do the Monte Carlo and SciPy p-values become almost identical?Replace the Gaussian noise with a heavy-tailed distribution (e.g., Student t with small df). What breaks first?
Implement a bootstrap CI for \(\mu_d\) using only NumPy, then compare it to the t-based CI.